Day 0: 從爬蟲到架站-取得資料

第 12 屆 iThome 鐵人賽

DAY 1

自我挑戰組

從爬蟲到架站系列第 1 篇

12th鐵人賽

jeff3071

2020-09-01 23:03:35

2522 瀏覽

分享至

前言

這篇文章主要是記錄我的學習歷程，以棒球為主題搭建一個數據查詢的網站，從取得資料到架設網站，過程會使用Python來爬取及架設網站。

取得數據

數據的來源為中職的官網
但因為中職官網沒有進階數據如OPS+等，會從CPBL stat來取得比較進階的數據

首先要做的是逐月分的表現，只有中職官網有數據，首先觀察球員個人頁面的url

林安可頁面
http://www.cpbl.com.tw/player/apart.html?player_id=k429&teamno=L01&year=2020&type=05

蘇智傑頁面
http://www.cpbl.com.tw/player/apart.html?player_id=H594&teamno=L01&year=2020&type=05

比較一下發現type與year是不變的，剩下兩個就是需要的變數，必須想辦法取得player_id與teamno，回到全紀錄查詢，按下F12發現在<td>下的<a>可以取得，接著決定要如何儲存資料，而這邊決定用csv來儲存。

import requests
from bs4 import BeautifulSoup
import csv

def get_all_player():
    url = "http://www.cpbl.com.tw/stats/all.html?&game_type=01&&stat=pbat&year=2020&online=1&per_page="

    fieldnames = ['Name','ID','Team ID']
    with open("player_ID.csv", 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for i in range(5):
            url = url
            
            print("爬取" + url + str(i+1) + "中")
            r = requests.get(url + str(i+1))

            soup = BeautifulSoup(r.text, 'html.parser')

            td = soup.select("td a")

            for j in range(len(td)):
                player_id = {}
                temp = td[j]["href"].split("?")[1].split("&")
                player_id['Name'] = td[j].text.strip()
                player_id['ID'] = temp[0].split("=")[1]
                player_id['Team ID'] = temp[1].split("=")[1]
                writer.writerow(player_id)

有了ID後，接著要取得各個月份的數據。

打開選手頁面，這個表就是目標

按下F12觀察，這個表是table來排版的
並且在數據的部分有加class
找到了目標位置就可以開始解析

player_url = "http://www.cpbl.com.tw/players/apart.html?year=2020&type=05&"
with open('player_ID.csv', 'r') as csvfile:
    rows = csv.DictReader(csvfile)
    for row in rows:
        player_url = player_url + 'player_id=' + str(row['ID']) + '&teamno=' + str(row['Team ID'])
        res = requests.get(player_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        player_td = soup.select(".display_a1")
    month = soup.select("td")
    month = soup.select("tr > td:nth-of-type(1)")
    month.remove(month[0])
    month.remove(month[0])

    team = soup.select('span')

    player_info, avg_list, OBP_list, SLG_list, PA_list, AB_list, RBI_list, H_list, HR_list = {
    }, {}, {}, {}, {}, {}, {}, {},{}

    for i in range(len(month)):
        avg_list[month[i].text] = player_td[19 + i*10].text
        OBP_list[month[i].text] = player_td[17 + i*10].text
        SLG_list[month[i].text] = player_td[18 + i*10].text
        PA_list[month[i].text] = player_td[10 + i*10].text
        AB_list[month[i].text] = player_td[11 + i*10].text
        RBI_list[month[i].text] = player_td[12 + i*10].text
        H_list[month[i].text] = player_td[13+i*10].text
        HR_list[month[i].text] = player_td[14+i*10].text

    player_info[player_td[9].text] = avg_list
    player_info[player_td[7].text] = OBP_list
    player_info[player_td[8].text] = SLG_list
    player_info[player_td[0].text] = PA_list
    player_info[player_td[1].text] = AB_list
    player_info[player_td[2].text] = RBI_list
    player_info[player_td[3].text] = H_list
    player_info[player_td[4].text] = HR_list

資料爬下來了，但是如果要進行繪圖的話還必須解決月份不一致的問題，有些選手可能一整個月都沒出賽，或是新秀升到一軍

total_info = {}
for info_type in player_info:
    info = ['0', '0', '0', '0', '0', '0', '0', '0']
    for data in player_info[info_type]:
        n = attr.index(data)
        info[n] = player_info[info_type][data]
    total_info[info_type] = info
total_info['Name'] = row['Name']

到這邊，算是完成了爬取資料的階段，但是爬取的速度上還可以改進，下一篇將會使用異步來優化這隻爬蟲。